This analysis aims to provide some insight about the car crashes in New York during the period of 2016 to 2022, focusing on some key factors such as vehicle type, hour and weather.
This analysis counts with 2 datasets, one containing the vehicle crashes and the second one about the weather of the location.
| Name | Rows | Columns | Each row is a | Link |
|---|---|---|---|---|
| Vehicles dataset | 2.11M | 29 | Motor Vehicle Collision | Vehicles dataset |
| Weather dataset | 59,760 | 10 | Time stamp of Weather | Weather dataset |
In this step, it is shown how the data was collected from the datasets, cleaned, and merged to create a comprehensive dataset for analysis. Before merging, the data was cleaned and enriched with additional information. Additionally, the data was shrunk to make it more manageable.
The data cleaning process involves removing columns with high NA ratios, filtering out rows with missing values, and creating new columns to categorize the main causes of accidents. The data is then enriched with additional information such as the day of the week, month, quarter, year, and time of day.
Both datasets, specially vehicles.csv, contain a lot of
rows which require a lot of memory and time to process. For this reason,
it has been decided to eliminate rows with missing values and columns
with high NA ratios, such as vehicle type 3, 4 and 5 as there are very
few values in these columns (multiple vehicle accidents).
Weather.xlsx is a smaller dataset and the cleaning
process is simpler, that just need to be convert the time column to a
correct format and rename some columns for better understanding.
Finally, the data is then merged with the weather data to create a
comprehensive dataset for analysis.
The result is:
| Name | Rows | Columns | Each row is a |
|---|---|---|---|
| Merged dataset | 1M | 40 | Combination of vehicle and weather data |
After merging the data, it is saved each year in a separate file to make it easier to analyse the data by year. This way is possible to perform both analyses on the full dataset and on years separately.
The following subsections illustrate the insights gained by generating plots of the given datasets.
As it is known, New York City is the city that never sleeps. Therefore,
several thousands of people are on the road which can lead to some
traffic jam and car crashes. Therefore, the first step was to have a
very broad understanding of the rough numbers. For this reason, the
first plot has been created to see, how many accidents there were in the
last few years.
So in this graph the total number of accidents per year in New York from 2016 to 2022 are shown. The number of accidents seems to be decreasing over the years, which is a positive trend.
It is important to note that no conclusions can be drawn from this graph alone, as other factors may have influenced the number of accidents, such as the COVID-19 pandemic and the lockdowns that occurred in 2020 and 2021, which could have reduced the number of vehicles on the road and, consequently, the number of accidents.
## [1] "Correlation coefficient: 0.59"
This graph shows the correlation between the total number of
accidents and the total rainfall per month in New York in 2021. The
result of 0.59 indicates a moderate positive correlation
between the two variables, suggesting that higher rainfall may lead to
more accidents. Nevertheless, there may be multiple other factors
influencing the final result such as how many cars/people are on the
road, any events or other environmental exposure.
Monthly Number of Accidents by Rainfall Category with Monthly Rainfall
This graph shows the monthly number of accidents in New York in 2021, categorised by rainfall intensity. The black line represents the total number of accidents per month, while the blue bars represent the total rainfall per month on a secondary y-axis. The dots represent the number of accidents in each rainfall category.
It can be appreciated that the number of accidents tends to increase with higher rainfall, especially in the “1 mm - 4 mm (Light rain)” and “>4 mm - 7 mm (Moderate rain)” categories.
At the same time, the majority of accidents occur in no rain conditions, which could be due simply to the fact that most of the time there is no rain in New York. Hence, ithout the a total amount of vehicles on the road, it is difficult to draw conclusions from this data alone.
Here the goal is to analyse the distribution of accidents by hour, day of the week, and month.
Hopefully, this analysis provides insights into the temporal patterns of accidents in New York, helping to identify high-risk and safer periods.
The hourly analysis shows accidents in New York throughout the 24h of each day. Accidents do not dip below the thousand at anytime, rising quickly from the early morning around 4AM to 8AM, when the rise slows down until 4PM to 5PM, which represent the peak hour for accidents. These hours represent the movement in the city, from 7AM where the inhabitants move to work, school etc. Around 5PM, where the max is, can be identified with the “Rush hour” where most of the workers end their shift and commute.
After 6PM, a slow decline in accidents continues during the evening and night until the next morning.
This plot is higly related to the previous one, as it shows the same information but in a different way, by grouping the hours in 4 categories, which are Morning, Afternoon, Late Afternoon and Night.
The graph illustrates that the likelihood of accidents in New York varies significantly depending on the time of day. The highest number of accidents occurs in the afternoon, followed by the evening. This trend suggests that increased traffic during these times, possibly due to people commuting home from work or school, contributes to a higher accident rate.
In contrast, the morning sees fewer accidents, which might be attributed to lighter traffic or more cautious driving as people start their day. The night has the lowest number of accidents, likely due to reduced traffic volume and fewer vehicles on the road.
Overall, the graph highlights the importance of being particularly cautious during the afternoon and evening when the risk of accidents is highest.
On a daily basis, it is appreciated how differences between days can be huge.
From the start of the week, until Wednesday there is a stable number of accidents, around the 9500. Here it starts to grow slightly until Thursday. On the last working day, on Friday, they rise steeply about a 10%, which could be linked to many factors, including that is the day that most people decide to go out or they move to outside New York for the weekend for different reasons (eg. visiting family, vacation…)
On the weekend on the other hand, maybe a bit counterintuitive, there is a substantial drop in accidents on the area, possibly for the motive mentioned above, of moving outside of this area, or quite the opposite, by its inhabitants not moving so much by vehicle, meaning a higher stay at home or activities nearby their residence. This drop continues even more on Sunday, which could be related to religious activities as well as possible local policies about certain activities.
The graph shows the monthly distribution of accidents in New York, highlighting clear trends throughout the year:
Low Accident Rates (January-February): Accidents start at moderate levels in January and dip to their lowest in February, likely due to harsh winter weather reducing travel.
Rising Accident Rates (March-June): Accidents increase sharply from March, peaking in June. This rise may be due to better weather, leading to more travel and increased road traffic.
Plateau and Decline (July-November): After a slight drop in July, accident numbers steadily decline from August to November, possibly due to the transition to fall and reduced travel as the year progresses.
Significant Drop (December): December sees a notable decrease in accidents, similar to January, possibly due to holiday periods and the onset of winter.
Accidents in New York peak in mid-year (April-June) and drop during winter months. Understanding these trends can help target road safety measures and reduce accidents during high-risk periods.
There are several factors that can contribute to car accidents, such as weather conditions, vehicle types, and the reported reason that caused the accident.
Some of this factors range from traffic rules violations, illegal actions such as driving under the effects of alcohol or other impairing drugs and can be divided in different categories such as human error, mechanical error, environmental conditions, and medical conditions.
Note that the data comes from real accidents, reported by the victims themselves. This means that the data is complex and innacurate. After cleaning and filtering, a big part of the reports could not be categorized properly, so an extra category for all of these was made.
This plot shows how the top 10 contributing factors to accidents are distributed. The most common factors are “Driver Inattention/Distraction”, “Failure to Yield Right-of-Way”, and “Following Too Closely”. These factors are often related to human error, which is a common cause of accidents.
“Unspecified” is the biggest category on this plot, which shows how many of the reports are lacking such important details, giving to tink about how seriously the involved parts along with law enforcement and insurance companies take this matter.
It can be asumend that this “accidents” might have been of minor damage, probably with no injuries and hence the low importance.
After categorizing the contributing factors, Human error is the most common cause of accidents, followed by Unspecified, then Environmental Conditions and Mechanical Error. At a first glance from the previous plot, it could be deduced that minor accidents of unknown nature are the most common, followed by “Distraction of the driver”. After all human error causes are agregated, it becomes the biggest category.
This shows that most accidents are caused by human mistakes, such as inattention, distraction, or failure to yield right-of-way.
The graph highlights the top 10 vehicle types involved in accidents in New York, offering insight into trends in vehicular safety. Sedans lead with over 31,000 incidents, likely due to their widespread use. SUVs and station wagons follow with more than 23,000 accidents, reflecting their popularity and the challenges of operating larger vehicles in urban settings.
Commercial vehicles such as taxis, pick-up trucks, and box trucks also account for a significant number of accidents. These vehicles are crucial to the city’s transportation system, and their involvement in accidents points to the risks associated with heavy traffic use. Buses, integral to public transport, appear on the list as well, showing that even professional drivers face difficulties in dense traffic.
Smaller vehicles like bikes, e-bikes, and motorcycles are less frequently involved in accidents but still make the top 10, highlighting the vulnerabilities of these road users. Ambulances, while less involved, are also on the list, possibly due to the nature of their high-speed operations.
The time series graph shows the daily count of accidents in New York over the course of 2021, with a visible trend line that highlights changes over time. Early in the year, there is a gradual increase in the number of accidents, which peaks around mid-year. This rise may reflect seasonal factors, such as increased travel and activity during warmer months.
As the year progresses into the latter half, the trend line begins to slope downward, indicating a decrease in accident frequency. This reduction could be associated with factors like reduced daylight hours in fall and winter, weather conditions, or changing traffic patterns as the year ends.
The day-to-day variability in accident counts, represented by the blue line, suggests that while there are general trends, daily fluctuations are significant, potentially driven by short-term factors like weather, events, or specific traffic incidents.
Overall, the graph provides insights into how accident rates evolved throughout the year, with clear periods of increase and decline, helping to identify patterns that could inform traffic safety measures and resource allocation.
The graph reveals several insights about the relationship between the number of accidents and injury rates across different boroughs in New York City.
The Bronx stands out with a relatively large dot, indicating a high number of accidents. Its position on the graph suggests that the injury rate is on the higher side compared to other boroughs, which could point to more severe traffic conditions or less effective safety measures.
Brooklyn also has a significant number of accidents, as indicated by its large dot. However, its injury rate is slightly lower than that of the Bronx. This difference might suggest that Brooklyn has better safety measures in place or different traffic conditions that result in fewer injuries per accident.
Manhattan’s dot is smaller than those for the Bronx and Brooklyn, indicating fewer accidents overall. The injury rate for Manhattan is moderate, suggesting that while there are fewer accidents, the conditions or severity of these accidents might differ from those in other boroughs.
Queens, similar to Brooklyn, has a large dot, indicating a high number of accidents. The injury rate in Queens is comparable to Brooklyn, which might imply similar traffic conditions or safety measures in place.
Staten Island, on the other hand, has the smallest dot, indicating the fewest accidents among the boroughs. Its injury rate is also the lowest, suggesting that Staten Island might benefit from better traffic safety measures or less congested roads, leading to fewer and less severe accidents.
Overall, the graph provides a clear comparison of the injury rates and total number of accidents across different boroughs, highlighting areas where traffic safety could potentially be improved.
The shown plots above shall be used as another step into a deeper analysis of when accidents are happening in New York. The plot shows three radar charts that compare the occurrence of accidents by weekday across three different categories: accidents with injuries, accidents with deaths, and accidents without injuries or deaths.
In the first chart, shaded in blue, accidents with injuries are displayed. The chart indicates that such accidents are fairly evenly distributed throughout the week, with a slight increase towards the end of the week, particularly on Friday and Saturday. There is a noticeable decrease on Monday, suggesting that fewer injury-related accidents occur at the start of the week.
The middle chart, shaded in red, focuses on accidents that result in deaths. This chart shows a clear concentration of deadly accidents towards the end of the week, especially on Friday and Saturday. In contrast, the beginning of the week, particularly Monday and Tuesday, sees fewer accidents with fatalities.
The third chart, shaded in green, represents accidents that do not result in injuries or deaths. These accidents appear to be more evenly spread throughout the week, with a slight increase during the middle of the week, particularly on Wednesday. The frequency of these accidents is lower on Sunday and Monday.
Overall, the charts reveal that accidents involving injuries and deaths tend to increase towards the weekend, while accidents without injuries or deaths occur more consistently throughout the week, with minor variations. The visual representation highlights the differences in accident patterns across the days of the week, with distinct trends for each type of accident.
With this information, the team decided to dig one layer deeper to find out, on what time, what type of accidents happen mostly and gain some insightful information with that.
As mentioned in previous plot, with the knowledge acquired, the team decided to focus now on the Time of day, where accidents occurred in 2021.
In the first heatmap, titled “Accidents with Injuries,” the data is shown for Thursday, Friday, and Saturday across different times of the day. The color intensity indicates the number of accidents, with darker reds representing higher numbers. The highest concentration of accidents with injuries occurs on Friday afternoons, as indicated by the darkest red square. This suggests that the risk of accidents causing injuries peaks during this time.
The second heatmap, titled “Accidents with Deaths,” similarly shows data for Thursday, Friday, and Saturday. The most intense red, representing the highest number of fatal accidents, is observed on Thursday mornings and Saturday early mornings. This indicates that fatal accidents are more likely to occur during these times. The distribution is more varied across the different times of day and days of the week compared to accidents with injuries.
The third heatmap, titled “Accidents without Injuries or Deaths,” also covers the same days of the week and times of day. Here, the darkest red, indicating the highest number of non-injury, non-fatal accidents, is again centered on Friday afternoons. This suggests that, similar to accidents with injuries, non-injury accidents also peak in the afternoon on Fridays.
In summary, all three heatmaps reveal patterns in the timing and frequency of different types of accidents. Accidents, whether resulting in injuries, deaths, or neither, tend to cluster around specific times of the day, particularly Friday afternoons. However, fatal accidents are more likely to occur early in the day on Thursdays and Saturdays.
Esquisse is a package that allows you to create interactive plots and dashboards in R. It is similar to Tableau in that it provides a user-friendly interface for creating visualizations without writing code. After loading your data, you can create plots interactively by dragging and dropping variables.
To use Esquisse, you need to install the package and then load it in
your R script. After loading the package, you can launch the web app by
calling the esquisser() function with your data as an
argument. That opens a web browser with the Esquisse interface, where
you can create plots interactively by dragging and dropping variables.
Esquisse works with the plotly package to create
interactive plots.
NOTE: Trying to load the data was tricky, Esquisse would not load the data from the merged dataset on the code. Opening Esquisse without an argument would start the app and prompt to load the data from the interface, but it would not load the data from the code.
Now trying to load the data from the interface, a warning message on the UI finally tells the problem, that loading more than 5MB of data was not possible.
This is a limitation of the Shiny app, which is used to create the Esquisse interface. The maximum request size for Shiny apps is 5MB by default, which is not enough for our dataset. It could have been a deal breaker for the use of Esquisse in this project and any other with a large dataset.
To solve this issue, it’s necessary to increase the maximum request
size for Shiny apps by setting the
options(shiny.maxRequestSize) option to a higher value. In
this case is set it to 300MB, which should be enough for the merged
dataset.
At the bottom left of the pane, in the options tab, you can select to
make the plots with plotly to make them interactive. Once
active, you can hover over the plots to see the data points and values
or click on the legend to filter the data.
In this plot, the number of accidents with injuries is shown by the hour of the day. The plot reveals that after 10 AM, the number of accidents with injuries increases repidly, peaking by midnight.
This can be attributed to various factors, such as increased traffic during the day, driver fatigue, and reduced visibility at night. Also as shown in the categories before, it could be related to alcohol consumption or other drugs, which are more likely to be consumed during the night.
This plot shows the number of accidents with deaths by the hour of the day. The plot indicates that the number of fatal accidents is relatively low during the day, with a slight increase in the evening. However, the number of fatal accidents rises significantly after midnight, peaking in the early morning hours.
This trend suggests that fatal accidents are more likely to occur during the night and early morning, again, possibly due to factors such as reduced visibility, driver fatigue, increased risk-taking behavior, and impaired driving due to alcohol or drugs.
This two plots are important to understand the risks of accidents at different times of the day and when accidents are more likely to result in injuries or fatalities. This information can help inform traffic safety measures and interventions to reduce the number of accidents and improve road safety.
Esquisse is a powerful tool for creating interactive plots and dashboards in R. It provides a user-friendly interface that allows you to create visualizations without writing code. The drag-and-drop functionality makes it easy to explore your data and create custom plots quickly.
For users familiar with Tableau or other data visualization tools, Esquisse offers a similar experience in R. You can create a wide range of plots, including scatter plots, bar charts, line graphs, and more. The interactive features allow you to explore your data in depth and gain insights into patterns and trends.
Esquisse is a valuable tool for data analysis, exploratory data visualization, and sharing insights with others. It is especially useful for users who prefer a visual interface for creating plots and dashboards. With Esquisse, you can create professional-looking visualizations that enhance your data analysis and storytelling.
Nevertheless, it comes with some limitations when compared with other software. It might not be as intuitive or powerful as Tableau, and it may not offer the same level of customization or advanced features. Also, the size of the dataset is limited by the maximum request size for Shiny apps, which can be a constraint for large datasets and has to be changed manually.
Overall, Esquisse is a valuable tool for creating interactive plots and dashboards in R, and it can be a great addition to your data analysis toolkit.
Below is a simple Shiny app, where users are able to select a date range (default takes the first and last date of the loaded dataset), the weekday(s) specify whether only accidents with Death(s) and/or injured people should be included it in. Depending on the selection, the plots is automatically being updated giving the user a direct feedback for doing some very basic analysis of how many accidents happen on what specific weekdays given the date range.
It is important to clarify that the above mentioned Esquisse is a Shiny application, that essentially is tailored to setup on the UI the plots that the user wants to see by clicking and dragging the variables, while the setup for the Shiny app is done by the developer, where the user can only select the variables to be shown. More functionality can be added to the Shiny app, but it requires more coding. In conclusion, it was more plesant to start working with it than expected as the user interface can be quite easily created without worrying too much about the design and user experience. Therefore, with the gained experience, both team members have seen its potential and would consider spend some extensive thoughts on a future implementation in another project during their studies, where applicable.
This section showcases an interactive map of accidents in New York
using the plotly package. The map shows the density of
accidents based on latitude and longitude coordinates, with color
indicating the density of accidents in each area. It is an interactive
map that allows you to zoom in and out and hover over data points to see
more information.
In the first view, it can look like that the whole New York is yellow, but when zooming in, the graph reveals the density of accidents in each area. This is to be expected and is due to the high number of inhabitants and accidents in New York.
It is interesting to see that the accidents are concentrated in certain areas, such as Manhattan and Brooklyn, which are more densely populated and have more traffic. Also, can be appreciated a trend that horizontal roads (East-West) have more accidents than vertical roads (North-South).
Also, as expected, the accidents are more concentrated in main roads and highways, where the traffic is heavier, until a bridge is reached. In the map, virtually all bridges are free of accidents, which is a good sign for traffic safety.
When we first set out on our data science project, we were excited but also a bit overwhelmed. We knew we were dealing with a complex problem—massive datasets, intricate relationships, and the need to create meaningful visualizations that could reveal the insights hidden within the data. As much as we loved diving into R and exploring the depths of data manipulation, we also knew that the journey ahead would be challenging and time-consuming. That’s when we decided to bring ChatGPT into our process.
Initially, we approached ChatGPT with cautious optimism. Could an AI really help us navigate the complexities of data science? Could it generate R code that would actually work? Well, we obviously knew the answer; Yes! Whether we were stuck on which plot would best represent our data or needed help writing a tricky piece of R code, ChatGPT was there, offering suggestions that we hadn’t even thought of. It became our go-to resource, a kind of brainstorming partner that was always available and always ready with ideas.
For example, when we were trying to figure out how to visualize our data, ChatGPT suggested options like heatmaps, radar charts, and density plots. Not only did it give us the idea, but it also provided the R code to create these visualizations. This saved us countless hours that we would have otherwise spent tweaking and troubleshooting code. When we encountered errors—inevitable in any data science project—ChatGPT was incredibly helpful in diagnosing the issues and suggesting fixes. It was like having a personal R consultant available at all times.
However, as much as ChatGPT boosted our productivity, it also taught us an important lesson: even with AI, the human touch is irreplaceable. We quickly learned that while ChatGPT could provide code and suggest visualizations, these outputs needed to be rigorously checked. Sometimes, the code didn’t produce exactly what we wanted, or the visualization didn’t tell the story we were trying to convey. We had to step back, critically evaluate the results, and make adjustments to ensure that our analysis was accurate and meaningful.
Another interesting challenge we faced was the impact of having so much information and functionality at our fingertips. Using ChatGPT made our process incredibly efficient, but it also shortened the time we usually spent brainstorming and exploring different approaches. There’s something to be said for the creative process that comes from struggling with a problem and experimenting with various solutions. With ChatGPT, we sometimes found that this creative journey was cut short, as solutions were so readily available. It was a trade-off between efficiency and the creative exploration that often leads to unexpected insights in data science.
Yet, in today’s data-driven world, where the ability to quickly analyze and interpret data is crucial, using AI tools like ChatGPT has become almost essential. The efficiency gains we experienced were undeniable. We could spend less time wrestling with the syntax of R and more time focusing on the bigger picture—interpreting the data, refining our analyses, and communicating our findings. ChatGPT didn’t take away our ability to think critically or creatively; instead, it enhanced our ability to execute on those thoughts more quickly and with greater precision.
Reflecting on our experience, it is clear that using ChatGPT was like having a super quick assistant in our data science toolkit. It did not replace our skills in R or our understanding of data, but it did significantly streamline our workflow. We were able to focus more on the insights and less on the mechanics, which made our project not only faster but also richer in the insights we could uncover.
In conclusion, while tools like ChatGPT are reshaping the way we approach data science, they also remind us of the importance of maintaining our critical thinking and creativity. AI can accelerate the process, but it is still up to us to ensure that the results are accurate, meaningful, and aligned with our goals. In the end, the combination of human insight and AI-driven efficiency is what made our project a success, and it is a balance we will continue to seek in future endeavors.
In this project, we explored a dataset of traffic accidents in New York City from 2016 to 2022, focusing on the data from 2021. We analyzed various aspects of the accidents, including the time, location, contributing factors, vehicle types, and injuries or deaths. Through data visualization and analysis, we gained insights into the patterns and trends of accidents in New York, helping to identify areas for traffic safety improvement.
We found that human error is the most common cause of accidents, with factors like inattention, distraction, and failure to yield right-of-way contributing to a significant number of incidents. Sedans and SUVs are the most common vehicle types involved in accidents, reflecting their widespread use in the city. Accidents are more likely to occur during the day, with a peak in the afternoon and evening hours.
By analyzing the data by borough, we identified differences in accident rates and injury rates across different areas of New York City. The Bronx and Brooklyn have higher accident rates and injury rates, while Staten Island has the lowest rates. These insights can inform targeted traffic safety measures and interventions to reduce accidents and improve road safety.